Registration has reached capacity. Join the waitlist

Understanding and Improving Communication Performance in Multi-node LLM Inference

Prajwal Singhania (University of Maryland), Siddharth Singh (University of Maryland), Lannie Dalton Hough (University of Maryland), Akarsh Srivastava (University of Maryland), Harshitha Menon (Lawrence Livermore National Laboratory), Charles Fredrick Jekel (Lawrence Livermore National Laboratory), Abhinav Bhatele (University of Maryland)

System Optimization & Efficiency

A detailed performance study of multi-node distributed LLM inference on GPU clusters that characterizes communication bottlenecks across model-parallel strategies—tensor, pipeline, and sequence parallelism—at scale. The results identify the dominant sources of inter-node communication overhead and provide optimization strategies validated on state-of-the-art inference engines.

Presentation

Talk

Paper Session 3: Systems Efficiency

Wednesday, May 27 · 3:40 PM – 3:50 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

View day schedule

Abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9×--3.6× lower latency than NCCL for message sizes between 128\,KB and 2\,MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72× reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

Artifacts & Links

Paper (ACM Digital Library)

                        Authors
                        Prajwal Singhania
University of Maryland
Siddharth Singh
University of Maryland
Lannie Dalton Hough
University of Maryland
Akarsh Srivastava
University of Maryland
Harshitha Menon
Lawrence Livermore National Laboratory
Charles Fredrick Jekel
Lawrence Livermore National Laboratory
Abhinav Bhatele
University of Maryland